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Abstract 

The ability to classify spoken speech based on the style of speaking is an important 
problem. With the advent of BPO’s in recent times, specifically those that cater to 
a population other than the local population, it has become necessary for BPO’s to 
identify people with certain style of speaking (American, British etc). Today BPO’s 
employ accent analysts to identify people having the required style of speaking. This 
process while involving human bias, it is becoming increasingly infeasible because 
of the high attrition rate in the BPO industry. In this paper, we propose a new 
metric, which robustly and accurately helps classify spoken speech based on the style 
of speaking. The role of the proposed metric is substantiated by using it to classify 
real speech data collected from over seventy different people working in a BPO. We 
compare the performance of the metric against human experts who independently 
carried out the classification process. Experimental results show that the performance 
of the system using the novel metric performs better than two different human expert. 


1 Introduction 

BPO’s (Business Process Outsourcing) centers are increasingly finding their way because 
of the increased quality consciousness, particularly in the service industry segment. Devel¬ 
opment in the area of telecommunications make it feasible for the BPO’s to be located in 
regions which it is servicing other than the local population. In addition socio-economic 
reasons justify the geographical location of BPO’s anywhere without the people being 
serviced being aware of it. This has led to a spate of BPO’s cropping up in developing 
countries where there exists a large population that can speak the language of the people 
not necessarily in the same style. For this reason, there is no dehnite recruitment qualihca- 
tion that one should possess to join a BPO, except that, one be able to speak in the style 
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of the population that the BPO services. The increase in number of BPO’s and the no 
specihc qualihcation requirement, leads to a situation of total influx, people are always on 
the move (high attrition). This leads to the requirement of a constant recruitment process 
at the BPO’s. Today, BPO’s with no exception, employ accent analyst to select candidates. 
The accent analyst judges the suitability of a candidate by analyzing the speaking style of 
the candidate. The process of recruitment is time consuming (on an average only about 
7% of the candidates appearing for the interview are selected) and is prone to human bias. 
There is a need for an automatic system that can measure the candidates speaking style 
or more precisely, classify the candidates speaking style as being suitable (good), trainable 
(average) or unsuitable (bad). 

Often one is able to make out the speakers background (American, British, Indian etc) 
by just listening to the spoken speech of the person. In addition, one is also able to tell if 
the person is speaking well or not, even in the absence of knowledge of the language being 
spoken. Thus, it is possible for a human to categorize speakers based of their speaking style 
by listening to their speech. A trained human is able to perform this task of classihcation 
better because he is aware of the nuances of what to be on the lookout for which identihes 
a well spoken speech. An ideal system^ would be the one that has the ability to classify 
people based on their speaking style by looking at their free-spoken speech. While work 
is on at the Cognitive Systems Research Laboratory of Tata Infotech, the development of 
such a system is still premature. 

In this paper, we propose a system that can be used to classify people based on their 
speaking style^. The heart of the system is the use of a new metric, which captures the 
speaking style of a person. Further, we describe the construction and use of such a metric. 
In this paper, we aim at developing a system that is able to categorize the speaking style 
of a person by analyzing predetermined set of words and sentences^. 

2 Metric to Classify spoken speech 

The speaking style and articulatory capability of spoken speech can be assessed automat¬ 
ically by comparing the test samples with ideal samples using a metric, 

V = (iDy, Pij, Mij) 

The metric T) is suitable for comparing two spoken words or sentences, i and j. Note 
that the metric V captures both the articulatory(iDjj) and the intonation (iPjj and Mij) 
capability of the speaker, both of which together characterise the speaking style of the 
spoken speech. While captures the closeness of the content of the two spoken words 
or sentences, IPij and Mij capture the closeness in terms of intonantion or the style of the 

^Essentially the system would be built by first analyzing and deriving rules by listening to spoken 
speech samples. These rules would enable development of the system to determine the quality of speech. 

^To assist BPO recruitment process. 

•^The speaker would be asked to speak a carefully selected list of words and sentences, which would be 
used by the system to analyze the style of speaking. 
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speaking. Note that IPij depends on the parameter pitch while the measnre IRij depends 
on the stress of the spoken speech. 


3 Problem Formulation 

3.1 Selection of Ideals 

The reference speech samples are analysed qualitatively and assigned a quantitative mea¬ 
sure based on the metric (triplet) discussed in Section 2. Consider the reference speech 
data is collected for W predehned words (or sentences) from Q classihcation groups (ex¬ 
ample, Very Good, Good, Average, Bad, Very Bad) of people. Assume that each group 
Q has Ng number of persons in it. Selection of ideals is based on initially segregating the 
spoken speech samples into W predehned words (or sentences) and a set of Q groups (or 
categories) of spoken style categories. For each w E W and g E Q, determine V^jg, the 
average over all the utterence by different number {Ng) of person. This produces a set of 
measurements which represents all the words and groups namely, using the 

pseudo code described in Algorithm 1. 


Algorithm 1 Computing the metric T>ij for the reference speech data. 

for i = 1; W do 
for j = 1; ^ do 
for k = 1: Ng do 
for 1 = 1: Ng do 

Calculate 


end for 
end for 

= 


end for 
end for 


{NgY 


A reference speech sample Rij is chosen for each word i = 1, • • ■ W and for each group 
j = i = if the variation within each word-group category is not larger than a 

predehned threshold. Else several (in the worst case all) reference speech samples in the 
word-group category are chosen. The estimation of helps in identifying for each 
word-group category. In essense, i?jj’s (one or several) are the chosen representatives of all 
the speech samples in the i^^ word and group category. 

Note that for each word i and each group j there is a reference speech sample 
set Rij (in the extreme case, this Rij could be multiple files encompasing all the 
reference speech files in that word-group category) and a score Vij. 
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3.2 Classification based on Ideals 

Given a test speech sample t, and Q category gronps, the problem is one of tagging the 
given test speech sample (t) to one of the gronps Q based on the closeness of the test speech 
sample to all the ideals in all the Q gronps, using either the or the metric. In 

all our experiments we use Ej^ norm. 

It is assumed that the content of the test speech sample t is known (meaning that the 
word or the sentence that has been spoke in known) and is x. Now, we compare the test 
speech sample with the reference speech samples R^j for j = i = 1, ■■■ ,Q and calculate the 
triplet scores 7) = {EDtj, IPtj, ^tj)- Note that 7) is obtained by comparing the test sample t 
with all the ideals in all the Q groups and then choosing the minimum 7j. The test speech 
sample, t, is classihed as belonging to the group g if the following criteria 


hDtg 

< 

IDtj 

hPtg 

> 

Ptj 

hRtg 

> 

IRtj 


is satished Vj = 1, • • •, ^ and j ^ g. 

4 Experimental Results 

A set of 20 words and 10 sentences were selected in consultation with phoneticians and ac¬ 
cent training experts. The set consisted of words and sentences which were very commonly 
prone to pronunciation error and in some cases the words were tongue twisters. The choice 
of the set is deemed to be capable of assessing the development of articulation of a person. 
Data was collected from a set of 20 people in each category (very good, good, average, 
bad and very bad speaking style). All person were asked to speak the predetermined set 
of 20 words and 10 sentences on the telephone using an IVR application custom built for 
collecting data. The speech data was tagged separately by two accent experts into one of 
the hve (very good, good, average, bad, very bad) categories. Table 1 gives the agreement 
between two human accent experts. Total agreement is when both the human experts 
categorised the same speech sample as belonging to the same category (example, both the 
experts say that the speech sample is good) and 1-step agreement corresponds to the the 
human experts differing on their categorization by a distance of 1 category (example, one 
expert say that the speech sample is good while the other says that the speech sample is 
very good or average) 

For purpose of experimentation to evaluate our system, we divided the speech data 
into 3 parts. We used data 2 parts of the data corresponding to each of the 5 categories 
to select the ideals and used the other 1 part to test the performance of the system. The 
overall performance of the system for classifying spoken speech is tabulated in Table 2. 
The performance of the automated system is much better than the performance between 
two human experts. Notice that the performance of the human expert - system (see Table 
2) is better than the expert-expert (see Table 1) performance. 
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Expert 1 - Expert 2 

Total Agreement 
1-step Agreement 

to 


Table 1: Agreement between two human experts. 



Expert 1 - System 

Expert 2 - System 

Total Agreement 

56 % 

47 % 

1-step Agreement 

100 % 

90 % 


Table 2: Agreement between the system and the two human experts. 

5 Conclusions 

With increase in BPO’s there is a need for automatic speaking style analyser. Speaking 
style analysis by human experts is bound to be biased by cues that might not necessarily 
be associated with the speaking style and the judgement of the speaking style is dependent 
on the human expert. To over come this bias that may be associated with human expert 
in analysing a person for his speaking style we have developed a system to automatically 
analyse the speaking style of a person. We proposed a metric which captures both the 
articulatory capability and the intonation of the speaker, both of which jointly characterise 
the speaking style of the person. Experiemntal results show that the performance of the 
system far exceeds the performance between two independent human experts. 
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